Overview

I decided to start a journal to keep track of the work I am doing in regards to applying LIME to the Hamby bullet data. I want a place to store my work to use for future purposes and where I do not feel the need to “perfect” code. I am hoping that this will be a place where I can write drafts of code and create first versions of graphics. Then I can clean up and organize the code when I write the papers.

Prior to this journal, I had been working on all of my ideas in the R markdown document for the firearm examiner paper, but now I am going to transfer a lot of that code to this journal. I am transferring them here now. Later I will move the necessary parts to the paper.

# Load libraries
library(bulletr)
library(caret)
library(cowplot)
library(furrr)
library(future)
library(gretchenalbrecht)
library(irr)
library(lime)
library(plotly)
library(purrr)
library(randomForest)
library(tidyverse)

# Source functions
source("../code/helper_functions.R")

# Load in the training data (Hamby Data 173 and 252) and testing data (Hamby Data 224 Sets 1 and 11)
hamby173and252_train <- read.csv("../data/hamby173and252_train.csv")
hamby224_test <- read.csv("../data/hamby224_test.csv") # created in the paper

# Read in the test_explain data
hamby224_test_explain <- readRDS("../data/hamby224_test_explain.rds")

# Obtain features used when fitting the rtrees random forest
rf_features <- rownames(rtrees$importance)

Plots of Features

Distributions by samesource

I created the following two plots of the training data as suggested by Heike. The histograms below show the distributions of the features used in the random forest rtrees. The histograms are filled by the samesource variable, which is the truth of whether or not the comparison is from the same barrel and land. The default histograms make it hard to compare the distributions of the matches and non-matches since there are many more comparisons that have samesource == FALSE.

# Create plots of the feature distributions colored by samesource for the training data
hamby173and252_train %>% 
  select(rf_features, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, samesource) %>%
  ggplot(aes(x = value, fill = samesource)) + 
  geom_histogram(bins = 30) + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Frequency", fill = "Same Source?",
       title = "Hamby 173 and 252 Training Data") +
  theme_bw() +
  theme(legend.position = "bottom")

By setting position = "fill" in the geom_histogram function, it is easier to compare the matches and non-matches. These plots could be used in the future to hand select the bins for lime. Additionally, fitting a logistic regression to this data could also be used to determine the LC50, LC10, ad LC90, which could be used as the bins for lime.

# Create plots of the feature distributions colored by samesource for the training data
# using the position = "fill" option
hamby173and252_train %>% 
  select(rf_features, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, samesource) %>%
  ggplot(aes(x = value, fill = samesource)) + 
  geom_histogram(position = "fill", bins = 30) + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Proportion", fill = "Same Source?",
       title = "Hamby 173 and 252 Training Data") +
  theme_bw() +
  theme(legend.position = "bottom")

The plots below have the same structure, but they are created with the Hamby 224 testing data. I chose to not separate the testing data by sets, but this is something that could be done later if necessary.

# Create plots of the feature distributions colored by samesource for the testing data
hamby224_test %>% 
  select(rf_features, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, samesource) %>%
  ggplot(aes(x = value, fill = samesource)) + 
  geom_histogram(bins = 15) + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Frequency", fill = "Same Source?",
       title = "Hamby 224 Testing Data") +
  theme_bw() +
  theme(legend.position = "bottom")

# Create plots of the feature distributions colored by samesource for the testing 
# data using the position = "fill" option
hamby224_test %>% 
  select(rf_features, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, samesource) %>%
  ggplot(aes(x = value, fill = samesource)) + 
  geom_histogram(position = "fill", bins = 15) + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Proportion", fill = "Same Source?",
       title = "Hamby 224 Testing Data") +
  theme_bw() +
  theme(legend.position = "bottom")

Correlations

I made these plots to look at the correlation between features in the training data within the TRUE and FALSE cases of samesource. The features are highly correlated for the match comparisons. It is clear that the variables are more correlated with the match comparisons than the non-match comparisons. However, there are still some variables that are relatively highly correlation with the non-match comparisons.

# Create a correlation heatmap of the match comparisons in the training data
cor_match <- hamby173and252_train %>%
  select(rf_features, samesource) %>%
  filter(samesource == TRUE) %>%
  select(-samesource) %>%
  cor() %>%
  reshape2::melt() %>%
  mutate(Var1 = factor(Var1, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms")),
         Var2 = factor(Var2, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms"))) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(limits = c(-1, 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") + 
  labs(x = "", y = "", fill = "Correlation",
       title = "Match Comparisons")

# Create a correlation heatmap of the non-match comparisons in the training data
cor_nonmatch <- hamby173and252_train %>%
  select(rf_features, samesource) %>%
  filter(samesource == FALSE) %>%
  select(-samesource) %>%
  cor() %>%
  reshape2::melt() %>%
  mutate(Var1 = factor(Var1, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms")),
         Var2 = factor(Var2, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms"))) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(limits = c(-1, 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  labs(x = "", y = "", fill = "Correlation",
       title = "Non-Match Comparisons")

# Create a title for the panel of plots
cor_title <- ggdraw() +
  draw_label("Correlation of Feature Variables in the Training Data", 
             fontface = "bold", 
             size = 16,
             x = 0.5,
             hjust = 0.5)

# Create a joined legend for the panel of plots
cor_legend <- get_legend(cor_match + theme(legend.position = "right"))

# Create the panel of plots
plot_grid(cor_title, 
          plot_grid(cor_match, cor_nonmatch, cor_legend, ncol = 3, rel_widths = c(2, 2, 0.5)),
          ncol = 1,
          rel_heights = c(0.25, 3))

The plots below show the correlations for the testing data. The patterns in the plots look really similar to ones of the training data.

# Create a correlation heatmap of the match comparisons in the training data
cor_match_test <- hamby224_test %>%
  select(rf_features, samesource) %>%
  filter(samesource == TRUE) %>%
  select(-samesource) %>%
  cor() %>%
  reshape2::melt() %>%
  mutate(Var1 = factor(Var1, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms")),
         Var2 = factor(Var2, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms"))) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(limits = c(-1, 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") + 
  labs(x = "", y = "", fill = "Correlation",
       title = "Match Comparisons")

# Create a correlation heatmap of the non-match comparisons in the training data
cor_nonmatch_test <- hamby224_test %>%
  select(rf_features, samesource) %>%
  filter(samesource == FALSE) %>%
  select(-samesource) %>%
  cor() %>%
  reshape2::melt() %>%
  mutate(Var1 = factor(Var1, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms")),
         Var2 = factor(Var2, levels = c("ccf", "cms", "matches", "rough_cor", "sum_peaks",
                                        "D", "sd_D", "mismatches", "non_cms"))) %>%
  ggplot(aes(x = Var1, y = Var2, fill = value)) + 
  geom_tile() + 
  scale_fill_gradient2(limits = c(-1, 1)) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "none") +
  labs(x = "", y = "", fill = "Correlation",
       title = "Non-Match Comparisons")

# Create a title for the panel of plots
cor_title_test <- ggdraw() +
  draw_label("Correlation of Feature Variables in the Testing Data", 
             fontface = "bold", 
             size = 16,
             x = 0.5,
             hjust = 0.5)

# Create a joined legend for the panel of plots
cor_legend_test <- get_legend(cor_match_test + theme(legend.position = "right"))

# Create the panel of plots
plot_grid(cor_title_test, 
          plot_grid(cor_match_test, cor_nonmatch_test, cor_legend_test, 
                    ncol = 3, rel_widths = c(2, 2, 0.5)),
          ncol = 1,
          rel_heights = c(0.25, 3))

Visualizations of Explanations

Heike made a plot with the structure of the one below when we began with the default settings of lime. This one has been created from the lime explanations with 2 quantile bins. It includes all three of the chosen features for each case in the testing dataset. For both sets, the cutoff of 0.275 < ccf occurs the most frequently.

# Plot of fequency for each bin division for 2 quantile bins
hamby224_test_explain %>%
  filter(situation == "2 quantile") %>%
  ggplot(aes(x = feature_desc)) +
  geom_bar() +
  coord_flip() + 
  facet_grid(set ~ .)

The plot below is the model for the one that will be used in the app for exploring the lime explanations from the bullet matching data. This is the data from set 1 of the testing dataset.

# Heatmap of rfscore for each comparison in set 1
hamby224_test %>%
  filter(set == "Set 1") %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2,
             text = paste('Bullets Compared: ', bullet1, "-", land1, 
                          "vs", bullet2, "-", land2,
                          '\nRandom Forest Score: ', 
                          ifelse(is.na(rfscore), "Missing due to tank rash", rfscore)))) +
  geom_tile(aes(fill = rfscore)) +
  facet_grid(bullet2 ~ bullet1, scales = "free") +
  theme_minimal() +
  scale_fill_gradient2(low = "darkgrey", high = "darkorange", midpoint = 0.5) +
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        panel.grid.major = element_blank(), 
        panel.grid.minor = element_blank()) +
  labs(x = "", y = "", fill = "RF Score")

I also wrote some code to create the above plot using individual plots instead of facets. This is included below. (It is a bit outdated compared to my newer version of the plot.)

# Code for creating the heatmap plots using individual plots
p1 <- hamby224_test %>%
  filter(set == "Set 1", bullet1 == "Known 1", bullet2 == "Known 1") %>%
  select(case, bullet1, bullet2, land1, land2, rfscore) %>%
  distinct() %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2)) +
  geom_tile(aes(fill = rfscore)) +
  theme_minimal() +
  scale_fill_gradient2(low = "grey", high = "orange", midpoint = 0.5) +
  labs(x = "Land 1", y = "Land 2", fill = "RF Score") + 
  theme(legend.position = "none")

p2 <- hamby224_test %>%
  filter(set == "Set 1", bullet1 == "Known 1", bullet2 == "Known 2") %>%
  select(case, bullet1, bullet2, land1, land2, rfscore) %>%
  distinct() %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2)) +
  geom_tile(aes(fill = rfscore)) +
  theme_minimal() +
  scale_fill_gradient2(low = "grey", high = "orange", midpoint = 0.5) +
  labs(x = "Land 1", y = "Land 2", fill = "RF Score") + 
  theme(legend.position = "none")

p3 <- hamby224_test %>%
  filter(set == "Set 1", bullet1 == "Known 1", bullet2 == "Questioned") %>%
  select(case, bullet1, bullet2, land1, land2, rfscore) %>%
  distinct() %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2)) +
  geom_tile(aes(fill = rfscore)) +
  #facet_grid(bullet1 ~ bullet2, scales = "free") +
  theme_minimal() +
  scale_fill_gradient2(low = "grey", high = "orange", midpoint = 0.5) +
  labs(x = "Land 1", y = "Land 2", fill = "RF Score") + 
  theme(legend.position = "none")

p4 <- ggplot() + geom_blank() + theme_classic()

p5 <- hamby224_test %>%
  filter(set == "Set 1", bullet1 == "Known 2", bullet2 == "Known 2") %>%
  select(case, bullet1, bullet2, land1, land2, rfscore) %>%
  distinct() %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2)) +
  geom_tile(aes(fill = rfscore)) +
  #facet_grid(bullet1 ~ bullet2, scales = "free") +
  theme_minimal() +
  scale_fill_gradient2(low = "grey", high = "orange", midpoint = 0.5) +
  labs(x = "Land 1", y = "Land 2", fill = "RF Score") + 
  theme(legend.position = "none")

p6 <- hamby224_test %>%
  filter(set == "Set 1", bullet1 == "Known 2", bullet2 == "Questioned") %>%
  select(case, bullet1, bullet2, land1, land2, rfscore) %>%
  distinct() %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2)) +
  geom_tile(aes(fill = rfscore)) +
  #facet_grid(bullet1 ~ bullet2, scales = "free") +
  theme_minimal() +
  scale_fill_gradient2(low = "grey", high = "orange", midpoint = 0.5) +
  labs(x = "Land 1", y = "Land 2", fill = "RF Score") + 
  theme(legend.position = "none")

p7 <- ggplot() + geom_blank() + theme_classic()

p8 <- ggplot() + geom_blank() + theme_classic()

p9 <- hamby224_test %>%
  filter(set == "Set 1", bullet1 == "Questioned", bullet2 == "Questioned") %>%
  select(case, bullet1, bullet2, land1, land2, rfscore) %>%
  distinct() %>%
  ggplot(aes(x = land1, y = land2, label = bullet1, label2 = bullet2)) +
  geom_tile(aes(fill = rfscore)) +
  #facet_grid(bullet1 ~ bullet2, scales = "free") +
  theme_minimal() +
  scale_fill_gradient2(low = "grey", high = "orange", midpoint = 0.5, limits = c(0,1)) +
  labs(x = "Land 1", y = "Land 2", fill = "RF Score")

style(subplot(p1, p2, p3, p4, p5, p6, p7, p8, p9, 
              nrows = 3, 
              titleX = TRUE, 
              titleY = TRUE, 
              margin = 0.03),
      hoverinfo = "skip",
      traces = 7)

Draft of LIME Procedure

I wrote this up to start thinking about how to describe the procedure that the lime R package uses to implement the LIME algorithm. It needs a lot of work, but it is a start. The final version of this will end up in the technical stats paper critiquing LIME. The version in the firearm examiner’s paper will be much simpler…

The steps below explain the procedure that the R package is using to apply the LIME algorithm to the bullet matching predictions on the Hamby 224 clone dataset made by the random forest model from Hare. For simplicity, the steps are described as what happens to one case in the test data. Thus, the steps (2) through (7) are repeated for each observation in the testing dataset.

Let \[Y_{jk} = \begin{cases} 1 & \mbox{ if bullets } j \mbox{ and } k \mbox{ were fired from the same gun barrel }\\ 0 & \mbox{otherwise} \end{cases}\] be the response variable in the training dataset, and \(X_1,...,X_9\) correspond to the nine features in the training dataset. Let \(X'_1,...,X'_9\) be the

  1. Distributions for each of the features in the training data are obtained.
    • The method that lime uses to obtain the distribution differs based on the feature type. All of the features in the Hamby datasets are numeric. For numeric features, the default option in lime (quantile_bins = TRUE) computes the quantiles of each feature based on the number of bins selected. The default number of bins is 4 (n_bins = 4).
  2. Many (\(n\)) samples from each of the feature distributions are drawn.
    • To do this, lime has several options (mostly quoted from lime package for now):
    • bin_continuous = TRUE should continuous variables be binned?
    • quantile_bins = TRUE should the ins for n_bins be based on quantiles or spread evenly
    • n_bins = 4 number of bins if bin_continuous is TRUE
    • use_density = TRUE if bin_continuous is FALSE, should continuous data be sampled using kernel density estimation (if not, then will assume normal for continuous variable)
  3. Predictions for the testing data using the random forest model are computed.
    • The random forest model rtrees is used to make a prediction for the observation from the test dataset and each of the \(n=5000\) samples as to whether or not the comparison of the two bullets in the test case are a match. Since the random forest is a classification model, lime is set to return the prediction probabilities.
  4. Similarity score between the observation in the testing data and each of the \(n=5000\) sampled values are obtained.
    • The way that the similarity score is computed depends on the type of feature. Since all of the features in the Hamby 224 test dataset are continuous, the simulated values are first converted into 0-1 features where a 1 indicates that the feature from the simulated value falls in the same bin as the observed data point and a 0 indicates that the feature is not in the same bin as the observed data point. Then, by default, the Gower distance is used to compute the similarity score. (using the gower package in R)
  5. Feature selection is performed by fitting some type of regression model weighted by the similarity scores is to the simulated data and the observed value. The 0-1 versions of the features are used.
    • The user can specify the number of features, \(m\), they would like to select to explain the prediction. lime supports the following options for feature selection
    • forward selection with ridge regression
    • highest weight with ridge regression
    • LASSO model
    • tree model
    • default: forward selection if \(m\le6\) with a ridge regression model, highest weight with ridge regression otherwise
  6. A ridge regression model is fit as the simple model by regressing the prediction probabilities on the \(m\) selected predictor variables and weighted by the similarity scores. If the response is categorical, the user can select how many categories and which categories they want to explain. \[P(Match = TRUE) = \beta_0 + \beta_1 \cdot I\left[X_1 \in \mbox{obs bin}\right] + \beta_2 \cdot I\left[X_2 \in \mbox{obs bin}\right] + \beta_3 \cdot I\left[X_3 \in \mbox{obs bin}\right]\] For the prediction of interest, \[P(Match = TRUE) = \beta_0 + \beta_1 + \beta_2 + \beta_3.\]
  7. The feature weights are extracted and used as the explanations.

Note: I realized that if bin_continuous = FALSE, then bins are not used at all. Instead, a kernel density estimator is used to sample from the distribution (or a normal distribution if specified), and then the ridge regression models are fit without “numerified” values.

Notation

Step by Step Procedure

Step 1:

Computation Issues

Plotly Dead Space Issue

I found an issue with plotly. Below is the reproducible example that I sent to Carson. He was able to fix the issue, so the example below no longer has the problem.

# Example with Plotly to show dead space in heatmap

# System information
# R version 3.5.1 (2018-07-02)
# Platform: x86_64-apple-darwin15.6.0 (64-bit)
# Running under: macOS  10.14

# Load libraries
# library(plotly) # version 4.8.0
# library(ggplot2) # version 3.1.0
# library(tidyr) # version 0.8.2

# Example from: https://plot.ly/r/heatmaps/ -------------------------------------------

# Create dataset
m <- matrix(rnorm(9), nrow = 3, ncol = 3)

# Create plotly heatmap - no dead space to be found
plot_ly(x = c("a", "b", "c"), y = c("d", "e", "f"), z = m, type = "heatmap")

# Example using ggplotly function -----------------------------------------------------

# Reshape the data for ggplot
m_gathered <- data.frame(m) %>%
  gather(key = column) %>%
  mutate(row = factor(rep(c("X1", "X2", "X3"), 3))) %>%
  select(column, row, value)

# Create ggplot heatmap
p <- ggplot(m_gathered, aes(x = column, y = row, fill = value)) +
  geom_tile()

# Apply plotly to ggplot heatmap - dead space in the middle of (X1, X1)
ggplotly(p)

# Carson's suggested fix for now
style(ggplotly(p), hoverinfo = "skip", traces = 2)

# Create ggplot heatmap without a legend
p_nolegend <- ggplot(m_gathered, aes(x = column, y = row, fill = value)) +
  geom_tile() +
  theme(legend.position = "none") 

# Apply plotly to ggplot heatmap - the dead space is gone!
ggplotly(p_nolegend)

Caret vs RandomForest

The lime function in lime is set up to work with specific packages. For example, lime works with a random forest model fit using the caret package, but it is not set up to work with a random forest fit using the randomForest package. I found a suggestion to apply the function as_classifier from the lime package to a model fit using randomForest in order for the lime function to accept the model. It seemed to work. However, I wanted to compare the lime results from a model fit using caret and the same model fit using randomForest. To do this, I used the iris data. The code below goes through the process of fitting the two models (rf_model and caret_model). Then the lime and explain functions are applied to both models.

# Code for comparing the output from LIME when the model is 
# fit with caret and randomForest
# Last Updated: 2018/11/13

## Set up -----------------------------------------------------------------------------

# Split up the data set
iris_test <- iris[1:5, 1:4]
iris_train <- iris[-(1:5), 1:4]
iris_lab <- iris[[5]][-(1:5)]

## LIME with caret --------------------------------------------------------------------

# Create Random Forest model on iris data
caret_model <- train(iris_train, iris_lab, method = 'rf')

# Create an explainer object
caret_explainer <- lime::lime(iris_train, caret_model)

# Explain new observation
caret_explanation <- lime::explain(iris_test, caret_explainer, n_labels = 1, n_features = 4)

## LIME with randomForest -------------------------------------------------------------

rf_model <- randomForest(iris_train, iris_lab)

# Create an explainer object
rf_explainer <- lime::lime(iris_train, model = as_classifier(rf_model))

# Explain new observation
rf_explanation <- lime::explain(iris_test, rf_explainer, n_labels = 1, n_features = 4)

To compare the lime explanations from the two models, I extracted the \(R^2\) value from the simple model, the simple model intercept, the simple model prediction, the feature values, and the feature weights from both explanation datasets. I computed the MSE between each of these values from the two models, and I plotted them on a scatterplot. Since lime is based on random permutations, I would not expect the values from the two models to be exact. However, I would like them to be close. The MSEs are all close to zero, and the plots suggest that the values do an okay job of following the 1-1 line. We decided this seems like the two versions of the explanations are reasonably exchangeable.

## Comparisons -----------------------------------------------------------------------

# Grab the numeric caret explanation variables of interest
caret_numeric <- caret_explanation %>%
  select(model_r2, model_intercept, model_prediction, feature_value, feature_weight) %>%
  gather(key = "variable", value = "caret_value")

# Grab the numeric randomForest explanation variables of interest
rf_numeric <- rf_explanation %>%
  select(model_r2, model_intercept, model_prediction, feature_value, feature_weight) %>%
  gather(key = "variable", value = "rf_value")

# Join the two
lime_numeric <- caret_numeric %>%
  mutate(rf_value = rf_numeric$rf_value, 
         variable = factor(variable))

# Look at the MSEs for the variables
lime_numeric %>%
  group_by(variable) %>%
  summarise(MSE = sum((caret_value - rf_value)^2) / dim(lime_numeric)[1])
## # A tibble: 5 x 2
##   variable                MSE
##   <fct>                 <dbl>
## 1 feature_value    0         
## 2 feature_weight   0.000487  
## 3 model_intercept  0.00000795
## 4 model_prediction 0.000101  
## 5 model_r2         0.0000134
# Scatterplots of the randomForest versus caret variable values
ggplot(lime_numeric, aes(x = caret_value, y = rf_value)) + 
  geom_point() + 
  facet_wrap( ~ variable, scales = "free") + 
  geom_abline(intercept = 0, slope = 1)

Comparing Furrr Times

The amount of time it took me to run the explain function with all of my input options was getting pretty long. Heike suggest that I use the furrr package, which implements the purrr functions using the speed of the future package. I ran and timed the code below to see how much faster the code was. Using the multiprocess option, the code took about half the amount of time to run (from 218.188 seconds to 108.731 seconds)!

# It took about half the time when using the function from furrr!

library(tictoc)
library(future)
library(furrr)

# Slow way
plan(sequential)
tictoc::tic()
sensitivity_explain <- future_pmap(.l = as.list(sensitivity_inputs %>% 
                                                         select(-case)),
                                          .f = run_lime, # run_lime is one of my helper functions
                                          train = hamby173and252_train %>% select(rf_features),
                                          test = hamby224_test %>% arrange(case) %>% select(rf_features) %>% na.omit(),
                                          rfmodel = as_classifier(rtrees),
                                          label = "TRUE",
                                          nfeatures = 3,
                                          seed = FALSE)
tictoc::toc()
# 218.188 sec elapsed

# Fast way
plan(multiprocess)
tictoc::tic()
sensitivity_explain <- future_pmap(.l = as.list(sensitivity_inputs %>% 
                                                         select(-case)),
                                          .f = run_lime, # run_lime is one of my helper functions
                                          train = hamby173and252_train %>% select(rf_features),
                                          test = hamby224_test %>% arrange(case) %>% select(rf_features) %>% na.omit(),
                                          rfmodel = as_classifier(rtrees),
                                          label = "TRUE",
                                          nfeatures = 3,
                                          seed = FALSE)
tictoc::toc()
# 107.731 sec elapsed

New Binning Methods

Subsampled Bins

One idea that I had to try to improve the lime explanations was to subsample from the training dataset before computing the quantiles. The proportion of observations with samesource = FALSE is very large in the training dataset, so that overwhelms the observations with samesource = TRUE. By subsampling, it might allow for better locations to bin the data. I wrote the code below to attempt to apply lime to a subsampled version of the training dataset.

# Code saved for future use for the sampling data

# Create the data (with a set seed)
set.seed(20181128)
hamby173and252_train_sub <- hamby173and252_train %>%
  filter(samesource == FALSE) %>%
  slice(sample(1:length(hamby173and252_train$barrel1), 4500, replace = FALSE)) %>%
  bind_rows(hamby173and252_train %>% filter(samesource == TRUE))

# Save the subsampled data
write.csv(hamby173and252_train_sub, "../data/hamby173and252_train_sub.csv", row.names = FALSE)

# Apply lime to the subsampled training data with the specified input options
hamby224_lime_explain_sub <- future_pmap(.l = as.list(hamby224_lime_inputs %>%
                                                        select(-case)),
                                         .f = run_lime, # run_lime is one of my helper functions
                                         train = hamby173and252_train_sub %>% select(rf_features),
                                         test = hamby224_test %>% arrange(case) %>% select(rf_features) %>% na.omit(),
                                         rfmodel = as_classifier(rtrees),
                                         label = "TRUE",
                                         nfeatures = 3,
                                         seed = TRUE) %>%
  mutate(training_data = factor("Subsampled"))

# Separate the lime and explain function results from the subsampled data
hamby224_lime_sub <- map(hamby224_lime_explain_sub, function(list) list$lime)
hamby224_explain_sub <- map_df(hamby224_lime_explain_sub, function(list) list$explain)

# Join the lime results from the full and subsampled training data
hamby224_lime <- c(hamby224_lime_full, hamby224_lime_sub)
hamby224_explain <- bind_rows(hamby224_explain_full, hamby224_explain_sub)

Tree Based Bins

Heike suggested the idea of using classification trees to choose the cutoffs for the lime bins. That is, the divisions in the tree could be used as the bin cuts. This would allow us to automate the process and to obtain a penalty parameter since the trees give us nesting. (MSE + lambda * p where p is the from the number of trees multiplied by the number of …)

Heike wrote the function treebink to take x and y variables for fitting a tree and returning the cuts for \(k\) bins based on the splits from the tree. The function is stored in the file helper_functions.R. Here is an example using the function below with the predictor variable of ccf.

# Example using treebink
treebink(y = hamby173and252_train$samesource, x = hamby173and252_train$ccf, k = 5)
## [1] 0.493562 0.588575 0.688211 0.778397

However, there seems to be a problem with treebink. For certain variables, when a high number of bins is requested, the function returns more than the desired number of bins. Heike is going to look into this.

She recently updated the function treebink to run using the tree package. Here are some comments from her about the code:

“I’ve moved it from rpart to tree. I am not quite sure that the two packages are doing the exact same thing. They claim they do but they might be implemented a bit differently. What we are doing now is if there are too many splits, we look at the order in which the splits are made and roll back some of them. I haven’t found a case that doesn’t work, but obviously that doesn’t show that there isn’t any :)”

I wrote the function mylime, which adds an option in the lime function from the LIME package to use Heike’s function treebink to obtain tree based bins. This function is also included in the file helper_functions.R. The output from mylime can then be input into the explain function to obtain explanations. I was able to rerun all of the code for creating the hamby224_test_explain dataset with the explanations from the tree based bins included for 2 to 6 bins.

Assessing Lime Explanations

When we began looking at the explanations from lime with the default settings, we did not think that they made sense. This led me to try applying LIME with a handful of input values. I also have tried running LIME with the tree based bins. Additionally, since LIME is based on random permutations, I was curious to know how consistent the results were. This led me to try running each implementation a handful of times. I have attempted to evaluate these results by both considering the accuracy and consistency of the results.

Accuracy

The figures in this section are created from the implementations of LIME on set 1 from the training data for the input options of 2 to 6 quantile bins, 2 to 6 equally spaced bins, 2 to 6 tree based bins, kernel density estimation, and normal distribution approximation. Each set of input values was only run once.

# Read in the lime comparison data (the comparisons are run in the firearm examiner
# paper since they will be discussed in the paper)
hamby224_lime_comparisons <- readRDS("../data/hamby224_lime_comparisons.rds")

Complex versus Simple Model Predictions

The plot below compares the predictions from the “simple model” (the ridge regression model) and the “complex model” (the random forest rtrees) on the x-axis from the lime implementations with bin estimation (bin_continuous == TRUE). The simple model predictions are on the y-axis, and the complex model predictions are on the x-axis. The plot is faceted by number of bins and the type of bin. The points are colored by the \(R^2\) value from the fit of the simple model. The lines are linear regression lines fit to the data points within a facet. We would hope that there is a linear relationship between these two variables. None of the cases show strong linear trends, but some are more linear than others. The quantile bins show that the simple model never makes a prediction over 0.6, whereas the random forest model can have predictions of up to 1. The equally spaced bins do have probabilities that exceed 0.6, but only with 3 and 6 bins. The tree based bins do the best job with predictions above 0.6 for all number of bins. The tree bins also seem to group the simple model predictions into two categories (probably based on the samesource variable). I noticed that within the facets, the points are in mostly horizontal strips, and the number of strips is about the same as the number of bins from the lime implementation.

# Plot of simple model versus complex model predictions
hamby224_lime_comparisons %>%
  filter(!is.na(rfscore), set == "Set 1", bin_continuous == TRUE) %>%
  ggplot(aes(x = rfscore, y = model_prediction)) + 
  geom_point(aes(color = model_r2)) + 
  facet_grid(bin_situation ~ nbins) + 
  theme_bw() + 
  stat_smooth(method = "lm", se = FALSE, size = 0.5) + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  labs(x = "Complex Model Prediction (RF Score)", y = "Simple Model Prediction",
       color = "Simple Model R2")

The plot below shows the absolute value of the difference between the complex model prediction and the simple model prediction versus the complex model prediction. The points are faceted by number of bins and bin type. Again, the points are colored by the \(R^2\) values from the simple model. All cases show a v-shaped trend. The low part of the v occurs around a random forest score of about 0.25. That is, the simple model is most accurately portraying the complex model predictions around the random forest score of 0.25. It gets worse near the extremes. However, it is interesting to note that with the equally spaced bins and the tree based bins, the absolute difference decreases near 1. This is not the case with the quantile bins. It appears that the equally spaced bins and tree based bins are able to make slightly better predictions than the quantile bins when there is a high probability of a match.

# Plot of difference in predictions versus rfscore
hamby224_lime_comparisons %>%
  filter(!is.na(rfscore), set == "Set 1", bin_continuous == TRUE) %>%
  ggplot(aes(x = rfscore, y = abs(diff))) + 
  geom_point(aes(color = model_r2)) + 
  facet_grid(bin_situation ~ nbins) + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) +
  labs(x = "Complex Model Prediction (RF Score)", 
       y = "Difference Between Complex and Simple Model Predictions",
       color = "Simple Model R2")

At some point, I got the idea in my head that it would be interesting to look at the “residuals” (difference in complex and simple model predictions) by feature. The plot below shows one example with the data from 3 equally spaced bins. There are clear trends in the plots, but I am not quite sure what to make of this or how to use it. This may be something to return to.

# Plots of model "residuals"
hamby224_lime_comparisons %>%
  filter(quantile_bins == FALSE, bin_method == "equally_spaced", nbins == "3", set == "Set 1") %>%
  select(rf_features, diff, -case) %>%
  gather(key = feature, value = feature_value, rf_features) %>%
  select(feature, feature_value, diff) %>%
  ggplot(aes(x = feature_value, y = diff)) + 
  geom_point() + 
  facet_wrap(~ feature, scales = "free") + 
  geom_hline(yintercept = 0, color = "blue") + 
  labs(x = "Feature Value", y = "Residual")

Comparing Input Values by MSE and \(R^2\)

In order to assess which lime implementation is doing the best job of capturing the predictions from the random forest model, I decided to calculate the mean squared error for each of the input situations. I defined the mean squared error as \[\frac{\sum_{i=1}^n (\hat{p}_{\mbox{simple},i}-\hat{p}_{\mbox{complex},i})^2}{n}\] where \(n\) is the number of observations in the testing dataset (within a set). Additionally, I was curious to compare the fits of the ridge regression models across input values, so I calculated the average \(R^2\) for each input situation.

# Summarizing lime results
hamby224_lime_results <- hamby224_lime_comparisons %>%
  filter(!is.na(rfscore)) %>%
  group_by(situation, bin_situation, bin_continuous, quantile_bins, nbins, use_density, 
           bin_method, response, set) %>%
  summarise(mse = (sum(diff^2)) / length(diff),
            ave_r2 = mean(model_r2)) %>%
  arrange(set) %>%
  ungroup()

The plot below shows the mean squared errors for each of the input situations faceted by set. In both cases, the mean squared error with 3 equally spaced bins is low, and the MSE for the tree binning cases are also low.

# Plot of MSEs
hamby224_lime_results %>%
  mutate(bins = ifelse(nbins == 4 & bin_continuous == FALSE, "other", nbins)) %>%
  ggplot(aes(x = bin_situation, y = mse, color = bin_situation)) +
  geom_point() + 
  facet_grid(set ~ bins, scales = "free_x", space = "free_x") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1), 
        legend.position = "bottom") + 
  labs(x = "Bin Type", y = "MSE", color = "") + 
  scale_color_gretchenalbrecht(palette = "winged_spill", discrete = TRUE)

The plot below shows the average \(R^2\) value for each of the input situations faceted by set. The results are very similar for each set, and the highest \(R^2\) values occur for 2 quantile bins. The \(R^2\) values for 3 equally spaced bins are in the middle, and the \(R^2\) values for all of the tree based bins are the lowest.

# Plot of R^2s
hamby224_lime_results %>%
  mutate(bins = ifelse(nbins == 4 & bin_continuous == FALSE, "other", nbins)) %>%
  ggplot(aes(x = bin_situation, y = ave_r2, color = bin_situation)) +
  geom_point() + 
  facet_grid(set ~ bins, scales = "free_x", space = "free_x") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        legend.position = "bottom") + 
  labs(x = "Bin Type", y = "Average R2", color = "") + 
  scale_color_gretchenalbrecht(palette = "winged_spill", discrete = TRUE)

Consistency with One Implementation

I wanted to look at the consistency of features chosen by lime across the number of bins within different bin types. I would like the features chosen to not depend on the number of bins used. Otherwise, it will be difficult to know how many bins to use. Additionally, it would be nice if the features chosen differed across cases, so that the explanations are based on the locality of the observation and not general across all locations. The code below creates a dataset that identifies the order in which the features are chosen by lime.

# Creates a dataset that creates columns of the chosen features by lime from first to third
chosen_features <- hamby224_test_explain %>%
  filter(!(bin_situation %in% c("kernel density", "normal approximation"))) %>%
  filter(!is.na(rfscore)) %>%
  select(set, situation, bin_situation, nbins, case, samesource, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(situation, case, desc(feature_weight_abs)) %>%
  mutate(explainer = rep(c("first", "second", "third"), length(situation) / 3)) %>%
  select(-feature_weight, -feature_weight_abs) %>%
  spread(explainer, feature)

In order to access consistency, I have decided to compute Fleiss’ kappa for each of the bin types. Fleiss’ kappa is often used to assess the consistency across raters for any number of raters and with nominal ratings. It calculates the degree of agreement in classification over that which would be expected by chance. A description of how it is computed can be found on the wikipedia site: https://en.wikipedia.org/wiki/Fleiss%27_kappa

I computed kappa for each bin situation separately for the top three features chosen. These values are plotted below for each set. For both the first and second features chosen by lime, the values of kappa suggest that quantile bins are the most consistent across all the number of bins followed by the tree based bins. The values of kappa for the quantile bins are slight to fair agreement. For the third chosen variable, the tree based bins have the highest value of kappa, but all the kappa values for the third chosen feature are low.

# Computes and plots the values of kappa
chosen_features %>%
  ungroup() %>%
  select(-situation) %>%
  gather(key = chosen, value = feature, first:third) %>%
  spread(nbins, feature) %>%
  select(-case) %>%
  rename(bins2 = "2", bins3 = "3", bins4 = "4", bins5 = "5", bins6 = "6") %>%
  group_by(set, bin_situation, chosen) %>%
  summarise(kappa = 
              irr::kappam.fleiss(matrix(c(bins2, bins3, bins4, bins5, bins6), 
                                        ncol = 5))$value) %>%
  arrange(chosen, set, bin_situation) %>%
  ggplot(aes(x = bin_situation, y = kappa, color = chosen, group = chosen)) +
  geom_point() + 
  geom_line() +
  facet_grid( ~ set) +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_color_gretchenalbrecht(palette = "pink_cloud", discrete = TRUE) +
  labs(x = "Bin Type", y = "Kappa", color = "Feature")

I also attempted to plot the features chosen by lime in order to visually assess the consistency. This plots are included below.

# Plots the top feature for each case for each bin situation
p <- ggplot(chosen_features, aes(x = situation, y = case, fill = first)) + 
  geom_tile() + 
  facet_grid(set + samesource ~ bin_situation, scales = "free", space = "free_y") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) + 
  labs(x = "Bin Type", y = "Case", fill = "Feature", title = "First Feature")

# Plots the second feature for each case for each bin situation
ggplot(chosen_features, aes(x = situation, y = case, fill = second)) + 
  geom_tile() + 
  facet_grid(set + samesource ~ bin_situation, scales = "free", space = "free_y") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) + 
  labs(x = "Bin Type", y = "Case", fill = "Feature", title = "Second Feature")

# Plots the third feature for each case for each bin situation
ggplot(chosen_features, aes(x = situation, y = case, fill = third)) + 
  geom_tile() + 
  facet_grid(set + samesource ~ bin_situation, scales = "free", space = "free_y") + 
  theme_bw() + 
  theme(axis.text.x = element_text(angle = 45, hjust = 1),
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) + 
  labs(x = "Bin Type", y = "Case", fill = "Feature", title = "Third Feature")

I also created the plot below to visually assess the proportions of cases within a bin situation and level of feature chosen by lime (first, second, or third) that have 1 to 5 chosen features. The plots are separated by set.

# Plots the proportions of number of top features chosen across number of bins for a case
# within a binning method
chosen_features %>%
  group_by(set, case, bin_situation) %>%
  summarise(first = length(levels(factor(first))),
            second = length(levels(factor(second))),
            third = length(levels(factor(third)))) %>%
  gather(chosen, nfeatures, first:third) %>%
  ggplot(aes(x = bin_situation, fill = factor(nfeatures))) + 
  geom_bar(position = "stack") + 
  facet_grid(set ~ chosen) + 
  labs(x = "Bin Type", y = "Count", fill = "Number of Features") + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "bottom") + 
  scale_fill_gretchenalbrecht(palette = "winter_light", discrete = TRUE)

Consistency Across Implementations

In order to assess how variable the LIME results are, I ran each of the input situations 10 times. I wanted to look at the variability of the MSE for each situation, and I wanted to see how consistent the choice of selected features is by LIME.

# Perform the sensitivity analysis if not already saved
if(!file.exists("../data/hamby224_sensitivity_inputs.rds")) {
  
  # Specify the number of reps and input cases
  nreps = 10
  noptions = 22
  
  # Specify the inputs for the sensitivity analysis
  hamby224_sensitivity_inputs <- list(bin_continuous = c(rep(TRUE, nreps * 20),
                                                         rep(FALSE, nreps * 2)),
                                      quantile_bins = c(rep(c(TRUE, FALSE), each = nreps * 5), 
                                                        rep(TRUE, nreps * 12)),
                                      nbins = c(rep(rep(2:6, each = nreps), 4), 
                                                rep(4, nreps * 2)),
                                      use_density = c(rep(TRUE, nreps * 21), 
                                                      rep(FALSE, nreps)),
                                      bin_method = c(rep("quantile_bins", nreps * 5),
                                                      rep("equally_spaced", nreps * 5),
                                                      rep("tree", nreps * 10),
                                                      rep("quantile_bins", nreps * 2)),
                                      response = c(rep(NA, nreps * 10),
                                                   rep("samesource", nreps * 5),
                                                   rep("rfscore", nreps * 5),
                                                   rep(NA, nreps * 2)))
  # Tell R to run the upcoming code in parallel
  plan(multiprocess)
  
  # Run lime for the sensitivity analysis and organize the output
  hamby224_sensitivity_outputs <- future_pmap(.l = hamby224_sensitivity_inputs,
             .f = run_lime, # run_lime is one of my helper functions
             features = rf_features,
             train = hamby173and252_train,
             test = hamby224_test %>% 
               arrange(case) %>% 
               select(rf_features) %>% 
               na.omit(),
             rfmodel = as_classifier(rtrees),
             label = "TRUE",
             nfeatures = 3,
             seed = FALSE) %>%
    map_df(function(list) list$explain) %>%
    mutate(rep = factor(rep(rep(1:nreps, each = dim(hamby224_test %>% na.omit())[1] * 3),
                            noptions)))
  
  # Turn the input options into a dataframe to be saved
  hamby224_sensitivity_inputs <- hamby224_sensitivity_inputs %>%
    unlist() %>%
    matrix(ncol = length(hamby224_sensitivity_inputs), 
           dimnames = list(NULL, names(hamby224_sensitivity_inputs))) %>%
    as.data.frame() %>%
    mutate(case = 1:(nreps * noptions)) %>%
    select(case, bin_continuous:bin_method)

  # Save the sensitivity inputs and outputs
  saveRDS(hamby224_sensitivity_inputs, "../data/hamby224_sensitivity_inputs.rds")
  saveRDS(hamby224_sensitivity_outputs, "../data/hamby224_sensitivity_outputs.rds")
  
} else {
  
  # Load in the sensitivity inputs and outputs
  hamby224_sensitivity_inputs <- readRDS("../data/hamby224_sensitivity_inputs.rds")
  hamby224_sensitivity_outputs <- readRDS("../data/hamby224_sensitivity_outputs.rds")

}
# Join the sensativity outputs with the test data
hamby224_sensitivity_joined <- hamby224_sensitivity_outputs %>%
  full_join(hamby224_test %>% na.omit() %>% 
                mutate(case = as.character(case)), by = "case") %>%
    mutate(case = factor(case)) %>%
    select(case, model_r2:feature_weight, bin_continuous:rep, 
           set:land2, rfscore, samesource) %>%
    mutate(diff = rfscore - model_prediction,
         nbins = as.numeric(as.character(nbins)),
         situation = ifelse(bin_continuous == TRUE & bin_method == "quantile_bins", 
                            sprintf("%.0f quantile", nbins),
                            ifelse(bin_continuous == TRUE & bin_method == "equally_spaced",
                                   sprintf("%.0f equally spaced", nbins),
                                   ifelse(bin_continuous == TRUE & bin_method == "tree" &
                                            response == "samesource",
                                          sprintf("%.0f samesource tree", nbins),
                                          ifelse(bin_continuous == TRUE & bin_method == "tree" &
                                            response == "rfscore",
                                            sprintf("%.0f rfscore tree", nbins),
                                            ifelse(bin_continuous == FALSE & 
                                                   use_density == TRUE, 
                                                   "kernel density", 
                                                   "normal approximation"))))) %>%
           fct_relevel("2 quantile", "3 quantile", "4 quantile",
                       "5 quantile", "6 quantile", "2 equally spaced",
                       "3 equally spaced", "4 equally spaced",
                       "5 equally spaced", "6 equally spaced",
                       "2 samesource tree", "3 samesource tree",
                       "4 samesource tree", "5 samesource tree",
                       "6 samesource tree")) %>%
    mutate(bin_situation = ifelse(bin_method == "quantile_bins" & 
                                  bin_continuous == TRUE,
                                  "quantile",
                                  ifelse(bin_method == "equally_spaced" & 
                                         bin_continuous == TRUE,
                                         "equally spaced", 
                                         ifelse(bin_method == "tree" & 
                                                bin_continuous == TRUE & 
                                                response == "samesource",
                                                "samesource tree", 
                                                ifelse(bin_method == "tree" & 
                                                       bin_continuous == TRUE & 
                                                       response == "rfscore",
                                                       "rfscore tree", 
                                                       ifelse(bin_continuous == FALSE & 
                                                       use_density == TRUE, 
                                                       "kernel density", 
                                                       "normal approximation")))))) %>%
    mutate(bin_situation = factor(bin_situation))
# Code for organizing the sensitivity analysis results
hamby224_sensitivity_results <- hamby224_sensitivity_joined %>%
  group_by(set, rep, situation, bin_situation, nbins) %>%
  summarise(mse = sum((diff)^2) / length(diff),
            ave_r2 = mean(model_r2))

The plot below shows boxplots of the mean squared errors computed for each of the input situations for each set. There does not appear to be much variation for most of the cases. The largest amount of variation occurs with 4 and 5 equally spaced bins. There is moderate variability for 3 equally spaced bins. The lowest MSEs occur for the samesource tree based bins followed by the rfscore tree based bins for most cases.

# Plots the MSEs from the sensativity analysis
hamby224_sensitivity_results %>%
  mutate(bins = ifelse(bin_situation %in% c("kernel density", "normal approximation"),
                       "other", as.character(nbins))) %>%
  ggplot(aes(x = bin_situation, y = mse)) + 
  geom_boxplot(aes(color = bin_situation)) +
  facet_grid(set ~ bins, scales = "free_x", space = "free_x") + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "bottom") +
  labs(x = "Bin Type", y = "MSE", color = "") + 
  scale_color_gretchenalbrecht(palette = "winged_spill", discrete = TRUE)

The plot blow shows boxplots of the average \(R^2\) values from the 10 models for all input options and for each set. There is minimal variability in the average \(R^2\) value for all cases, and the relationships between \(R^2\) values and input cases agree with the single implementation seen below.

# Plots the average R^2 value from the sensativity analysis 
hamby224_sensitivity_results %>%
  mutate(bins = ifelse(bin_situation %in% c("kernel density", "normal approximation"),
                       "other", as.character(nbins))) %>%
  ggplot(aes(x = bin_situation, y = ave_r2)) + 
  geom_boxplot(aes(color = bin_situation)) +
  facet_grid(set ~ bins, scales = "free_x", space = "free_x") + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "bottom") +
  labs(x = "Bin Type", y = "MSE", color = "") + 
  scale_color_gretchenalbrecht(palette = "winged_spill", discrete = TRUE)

To assess the consistency of the results, I determined the number of different features chosen by LIME as the top feature within a test case across the 10 replicates. I then computed the average number of different top features chosen by LIME within a input situations for each set. The plot below contains these values shown as the dots. The error bars represent one standard deviation above and below the mean. Note that the number of levels cannot go below 1, but the error bars go below 1. I have decided to ignore this for now, but I would need to come up with a better representation if I used this for something.

There are a handful of situations with a mean near one and very little variation. However, the largest mean is around 2, which is not very high. This seems to suggest that LIME is relatively consistent across runs.

# Determine and plot the average and variability of the number levels for each case 
# and input option
hamby224_consistency <- hamby224_sensitivity_joined %>%
  select(set, situation, bin_situation, nbins, case, rep, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(set, situation, case, rep, desc(feature_weight_abs)) %>%
  group_by(set, situation, nbins, case, rep) %>%
  slice(1) %>%
  ungroup() %>%
  group_by(set, situation, bin_situation, nbins, case) %>%
  summarise(nlevels = length(levels(factor(feature))),
            count = n()) %>%
  ungroup() %>%
  group_by(set, situation, bin_situation, nbins) %>%
  summarise(mean_nlevels = mean(nlevels),
            sd_nlevels = sd(nlevels),
            upper = mean(nlevels) + sd(nlevels),
            lower = mean(nlevels) - sd(nlevels),
            max = max(nlevels),
            min = min(nlevels)) %>%
  mutate(bins = ifelse(bin_situation %in% c("kernel density", "normal approximation"), 
                       "other", as.character(nbins)))

hamby224_consistency %>%
  ggplot(aes(x = situation, y = mean_nlevels)) + 
  geom_errorbar(aes(ymin = lower, ymax = upper), color = "grey80") +
  geom_point(aes(color = bin_situation)) + 
  facet_grid(set ~ bins, scales = "free_x", space = "free_x") + 
  labs(x = "Bin Type", y = "Mean Number of Features", color = "") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "bottom") + 
  scale_color_gretchenalbrecht(palette = "winged_spill", discrete = TRUE)

The bar chart below shows the proportion of different number of first, second, and third features chosen across all test cases within a set for each input situation. This shows that the most different features chosen was seven. However, many of the cases only have one or two different features chosen, and usually, a larger proportion of the cases have only one top feature chosen across the 10 reps.

# Determine the order that the features are chosen within a case and 
# set of input options
sensitivity_chosen <- hamby224_sensitivity_joined %>%
  select(set, situation, bin_situation, nbins, case, rep, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(set, situation, case, rep, desc(feature_weight_abs)) %>%
  mutate(explainer = rep(c("first", "second", "third"), length(situation) / 3)) %>%
  select(-feature_weight, -feature_weight_abs) %>%
  spread(explainer, feature)
# Determine and plot the number of features chosen across the 10 reps within 
# a set, input options, and feature level
sensitivity_chosen %>%
  group_by(set, situation, case) %>%
  summarise(first = length(levels(factor(first))),
            second = length(levels(factor(second))),
            third = length(levels(factor(third)))) %>%
  gather(chosen, nfeatures, first:third) %>%
  ggplot(aes(x = situation, fill = factor(nfeatures))) + 
  geom_bar(position = "stack") + 
  facet_grid(set ~ chosen) + 
  labs(x = "Bin Type", y = "Count", fill = "Number of Features") + 
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "bottom") + 
  scale_fill_gretchenalbrecht(palette = "flood_tide", discrete = TRUE, reverse = FALSE)

I computed Fleiss’s kappa for each sampling technique within a set across the 10 reps. These values are plotted below. Right now, the quantile methods and rfscore tree based bins seem to be the most consistent. However, I am not happy with some of the kappa values for cases where the next set of plots would suggest a high kappa, but the value is very low. I wonder if this a flaw in the Fleiss’s kappa and would suggest that I use a different metric.

sensitivity_chosen %>%
  mutate(bins = ifelse(bin_situation %in% c("kernel density", "normal approximation"), 
                       "other", 
                       nbins)) %>%
  gather(key = chosen, value = feature, first:third) %>%
  spread(rep, feature) %>%
  rename(rep1 = "1", rep2 = "2", rep3 = "3", rep4 = "4", rep5 = "5", 
         rep6 = "6", rep7 = "7", rep8 = "8", rep9 = "9", rep10 = "10") %>%
  group_by(set, situation, bin_situation, bins, chosen) %>%
  summarise(kappa = 
              irr::kappam.fleiss(matrix(c(rep1, rep2, rep3, rep4, rep5,
                                          rep6, rep7, rep8, rep9, rep10), 
                                        ncol = 10))$value,
            nunique = length(unique(c(rep1, rep2, rep3, rep4, rep5, 
                                      rep6, rep7, rep8, rep9, rep10)))) %>%
  mutate(kappa = ifelse(is.nan(kappa), 1, kappa)) %>%
  ggplot(aes(x = bin_situation, y = kappa, color = chosen, group = chosen)) + 
  geom_point() + 
  geom_line(aes(linetype = chosen)) +
  facet_grid(set ~ bins, scales = "free_x", space = "free_x") +
  theme_bw() +
  theme(axis.text.x = element_text(angle = 45, hjust = 1)) + 
  scale_color_gretchenalbrecht(palette = "pink_cloud", discrete = TRUE) +
  labs(x = "Sampling Technique", y = "Kappa", color = "Feature", linetype = "Feature")

The plots below need some work, but they contain some interesting information. Right now, I am only concentrating on set 1. We can see that with some lime settings, all cases have the same “best” variable chosen, and other settings have multiple important variables that depend on the case. We can also see that different variables are selected as important depending on the number of bins. It is also interesting to note that when comparing the quantile bins to the equally spaced bins, this plot shows that the equally spaced bins tend to choose the same first variable for all cases. On the other hand, the quantile bins choose different first variables for the cases.

The fact that different features get chosen as the top feature for different estimation techniques makes sense. You would expect that the number of bins would determine which features are better suited for predicting whether or not a comparison is a match. For example, if you look back at the feature distribution plots, you can see that ccf is the obvious choice for determining between matches and non-matches if you have to choose two equally spaced bins. This is the feature that lime most frequently chooses. You can make similar arguments for other cases as well. Even though different inputs select different variables, I would expect that some variables will be better at prediction than others. I think this is what leads some input values to better LIME explanations and lower MSEs than others.

In the meeting with Heike, we talked about how this plot suggests that LIME is not doing a very good job of providing local explanations. Regardless of the case, it is often suggesting only one of two variables are important, and the variable importance depends on the number of bins. In this plot, I would like consistency in the x-direction, but I would like variability in the y-direction to better understand the variables that are important for a specific case.

hamby224_sensitivity_joined %>%
  filter(set == "Set 1") %>%
  droplevels() %>%
  select(situation, case, rep, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(situation, case, rep, desc(feature_weight_abs)) %>%
  group_by(situation, case, rep) %>%
  slice(1) %>%
  ungroup() %>%
  group_by(situation, case) %>%
  mutate(nlevels = length(levels(factor(feature)))) %>%
  ungroup() %>%
  arrange(situation, desc(nlevels), case) %>%
  mutate(order = rep(rep(1:length(levels(case)), each = 10), 22)) %>%
  ggplot(aes(x = order, fill = feature)) + 
  geom_bar(position = "stack", width = 1) + 
  coord_flip() + 
  facet_wrap(situation ~ ., ncol = 5) +
  #facet_grid(nlevels ~ situation) +
  theme(legend.position = "bottom",
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) +
  labs(x = "cases")

hamby224_sensitivity_joined %>%
  filter(set == "Set 1") %>%
  droplevels() %>%
  select(situation, case, rep, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(situation, case, rep, desc(feature_weight_abs)) %>%
  group_by(situation, case, rep) %>%
  slice(2) %>%
  ungroup() %>%
  group_by(situation, case) %>%
  mutate(nlevels = length(levels(factor(feature)))) %>%
  ungroup() %>%
  arrange(situation, desc(nlevels), case) %>%
  mutate(order = rep(rep(1:length(levels(case)), each = 10), 22)) %>%
  ggplot(aes(x = order, fill = feature)) + 
  geom_bar(position = "stack", width = 1) + 
  coord_flip() + 
  facet_wrap(situation ~ ., ncol = 5) +
  #facet_grid(nlevels ~ situation) +
  theme(legend.position = "bottom",
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) +
  labs(x = "cases")

hamby224_sensitivity_joined %>%
  filter(set == "Set 1") %>%
  droplevels() %>%
  select(situation, case, rep, feature, feature_weight) %>%
  mutate(feature_weight_abs = abs(feature_weight)) %>%
  arrange(situation, case, rep, desc(feature_weight_abs)) %>%
  group_by(situation, case, rep) %>%
  slice(3) %>%
  ungroup() %>%
  group_by(situation, case) %>%
  mutate(nlevels = length(levels(factor(feature)))) %>%
  ungroup() %>%
  arrange(situation, desc(nlevels), case) %>%
  mutate(order = rep(rep(1:length(levels(case)), each = 10), 22)) %>%
  ggplot(aes(x = order, fill = feature)) + 
  geom_bar(position = "stack", width = 1) + 
  coord_flip() + 
  facet_wrap(situation ~ ., ncol = 5) +
  #facet_grid(nlevels ~ situation) +
  theme(legend.position = "bottom",
        axis.text.y = element_blank(),
        axis.ticks.y = element_blank()) + 
  scale_fill_gretchenalbrecht(palette = "last_rays", discrete = TRUE) +
  labs(x = "cases")

I was thinking a bit more about the relationship between the top features chosen and the type of binning used. I realized it would be helpful to view the relationship between the features and the random forest score. The plots below show the testing data with random forest score versus each of the features. The points are colored by the samesource variable.

hamby224_test %>% 
  select(rf_features, rfscore, samesource) %>%
  gather(key = feature, value = value, 1:9) %>%
  select(feature, value, rfscore, samesource) %>%
  ggplot(aes(x = value, y = rfscore, color = samesource)) + 
  geom_point() + 
  facet_wrap( ~ feature, scales = "free") +
  labs(x = "Variable Value", y = "Random Forest Score", color = "Same Source?",
       title = "Hamby 224 Testing Data") +
  theme_bw() +
  theme(legend.position = "bottom")

Input Rankings

In order to decide on which LIME settings to use for the firearm examiners’ paper, I decided to create a table that ranks each of the 12 input settings by MSE from the single implementation, the average MSE across the 10 implementations, the average number of top features chosen across the 10 implementations, and the average \(R^2\) value from the single implementation. The first table below is for set 1, and the second table below is for set 11.

While using these to help me decide which input settings to use, I had a handful of thoughts:

  • Since I am still working on making sense of the variation results, and it would be nice to get the first paper out, I am thinking of basing the decision of input values entirely off of only the assessment measures from the one case.
  • Even with needing more time to think about how to understand the meaning of the results from the 10 reps, the mean MSE from these has very similar results to the MSEs from the one rep, and it does not appear that the top chosen features vary too much between the reps. This makes me think that it would be okay to base my decision off of the results from the single rep.
  • Both Heike and I do not think that the \(R^2\) is that helpful of a measure for determining whether the LIME explanations are good, so I am not going to put much weight on it.
  • Additionally, I do not want to consider either the normal approximation or kernel density estimate since I have not figured out a good way to visualize these in the app. Right now, the feature plots from these explanations are not very helpful. I would need to adjust the intercept or something.

Based on the MSEs from the single implementation of LIME, I will use 3 equally spaced bins for both the set 1 and set 11 explanations in the firearm examiner paper since it leads to the lowest MSE for both sets.

joined_results <- hamby224_lime_results %>%
  select(set, situation, mse, ave_r2) %>%
  full_join(hamby224_sensitivity_results %>%
              group_by(set, situation) %>%
              summarise(mean_sensitivity_mse = mean(mse), 
                        sd_sensitivity_mse = sd(mse)),
            by = c("set", "situation")) %>%
  full_join(hamby224_consistency %>% select(set, situation, mean_nlevels, sd_nlevels), 
            by = c("set", "situation")) %>%
  group_by(set) %>%
  mutate(mse_order = dense_rank(mse),
         ave_r2_order = dense_rank(desc(ave_r2)),
         mean_sensitivity_mse_order = dense_rank(mean_sensitivity_mse),
         mean_nlevels_order = dense_rank(mean_nlevels))
joined_results %>%
  ungroup() %>%
  filter(set == "Set 1") %>%
  select(situation, mse_order, mean_sensitivity_mse_order, mean_nlevels_order, ave_r2_order) %>%
  arrange(mse_order, mean_sensitivity_mse_order, mean_nlevels_order, ave_r2_order) %>%
  DT::datatable(options = list(pageLength = 12, paging = FALSE, searching = FALSE),
                rownames = FALSE,
                colnames = c("Situation", "MSE", "Mean MSE", "Mean Number of Top Features", 
                             "Mean R2"),
                caption = "Results from Set 1")
joined_results %>%
  ungroup() %>%
  filter(set == "Set 11") %>%
  select(situation, mse_order, mean_sensitivity_mse_order, mean_nlevels_order, ave_r2_order) %>%
  arrange(mse_order, mean_sensitivity_mse_order, mean_nlevels_order, ave_r2_order) %>%
  DT::datatable(options = list(pageLength = 12, paging = FALSE, searching = FALSE),
                rownames = FALSE,
                colnames = c("Situation", "MSE", "Mean MSE", "Mean Number of Top Features", 
                             "Mean R2"),
                caption = "Results from Set 11")